Categorical data

Our first steps in working and modeling categorical data with R

Welcome

Introduction!

Welcome to our tenth tutorial for the Statistics II: Statistical Modeling & Causal Inference (with R) course.

During this week’s lecture you were introduced to categorical data and different approaches to model it.

In this lab session we will:

  • Introduce a logic of working with categorical variables with {forcats}
  • How to perform binary logistic regression in R
  • How to interpret and visualize results with {marginaleffects}

Evaluating public support for wind turbines and navigating green politics in Burovia

The small but ambitious country of Burovia has pledged to achieve net-zero carbon emissions by 2040. As part of this commitment, the Ministry of Sustainable Transformation and Electric Revolution (MISTER) has announced a nationwide expansion of wind energy infrastructure, with dozens of new wind farms scheduled for construction in the next three years.

While Burovians overwhelmingly support renewable energy in principle, public resistance has surged whenever specific locations are proposed—particularly near residential areas. Citizens express concern about noise pollution, impact on property values, and visual aesthetics. This disconnect between national-level support and local opposition has been dubbed the “green energy paradox” by the press—and is closely tied to what international observers call Not In My Backyard (NIMBY) attitudes.

MISTER has commissioned a study to better understand what drives public support or opposition to wind turbine projects at the local level. You are hired as a policy evaluation consultant for this initiative.

Can the wind of change blow through Burovia — or will it get stuck in a swirl of local resistance?


Packages

set.seed(42) #for consistent results

library(dplyr) # to wrangle our data
library(ggplot2) # to render our graphs
library(readr) # for loading the .csv data
library(janitor) # for data management and tabyl()
library(forcats) # to work with categorical data
library(kableExtra) # to render better formatted tables
library(modelsummary) # for formatting your model output

library(marginaleffects) # to aid interpretations and visual depictions

Exploring our set-up

wind_df <- readr::read_csv("https://raw.githubusercontent.com/seramirezruiz/2025-spring-stats2/refs/heads/main/data/wind_df.csv") # simulated data
names(wind_df) # to check the names of the variables in our data
## [1] "support_wind_turbines" "proximity_to_proposed" "env_concern"          
## [4] "trust_in_government"   "left_right_ideology"   "age"                  
## [7] "income"                "educ"

Our data set wind_df contains the following information:

  • support_wind_turbines: A binary variable indicating whether the respondent supports building wind turbines within 5km of residential areas in Burovia (1 = Yes, 0 = No)
  • proximity: A binary variable indicating whether the respondent lives near a proposed wind turbine site (1 = Lives nearby, 0 = Does not live nearby)
  • env_concern: A numeric categorical variable indicating education level (1 = Low, 2 = Medium, 3 = High)
  • trust_in_government: A binary variable indicating trust in the Burovian government (1 = Trusts, 0 = Does not trust)
  • left_right_ideology: A numeric scale indicating political ideology (0 = Left-leaning, 1 = Right-leaning)
  • age: A numeric variable indicating the age of the respondent (in years)
  • income: A numeric variable for respondent income (measured in Burovian Dollars)
  • educ: A numeric categorical variable indicating education level (1 = High school or less, 2 = University degree, 3 = Postgraduate degree)

Variable types

Before diving into regression types, let’s briefly review variable types:

  • Quantitative variables (numerical values) can be:
    • Discrete: countable values (e.g., number of wind turbines in district)
    • Continuous: any value within a range (e.g., income)
  • Qualitative (categorical) variables (non-numerical) can be:
    • Nominal: categories without order (e.g., env_concern)
    • Ordinal: categories with a meaningful order (e.g., educ)

A binary variable is a special case of nominal with only two outcomes (e.g., yes/no).


Working with categorical data with {forcats}

In R, factors are used to represent categorical variables. A factor is a compound object made up of two components:

  • A vector storing the actual category values
  • A set of levels defining the possible categories (they can also take ordering)
# creating a factor
countries_of_nationality <- factor(
  c("Italy", "France", "Germany", "Germany", "Austria")) 

countries_of_nationality
## [1] Italy   France  Germany Germany Austria
## Levels: Austria France Germany Italy
# creating a factor with ordered levels (see what happens when you remove ordered = TRUE)
sizes <- factor(
  c('sm', 'md', 'lg', 'sm', 'md'),
  levels = c('sm', 'md', 'lg'),
  ordered = TRUE)

sizes
## [1] sm md lg sm md
## Levels: sm < md < lg

While R has a native way to work with factors, the {forcats} package from the tidyverse provides a useful and flexible alternative. Write more pitch for it. {forcats} simplifies working with factors, especially when dealing with:

  • Reordering levels: Functions like fct_relevel() allow you to reorder factor levels easily, useful for controlling the order of categorical variables in analysis or plotting.
  • Recoding levels: fct_recode() allows you to recode levels, merging or renaming categories without needing to manually manipulate the data.
  • Handling missing values: {forcats} provides functionality to explicitly handle missing values in factors using fct_explicit_na().
  • Level manipulation: With functions like fct_drop(), it is simple to drop unused levels, which can be crucial when cleaning data.
  • Summary and visualization: {forcats} integrates seamlessly with ggplot2 and dplyr to summarize and visualize categorical data efficiently.

Some of the most useful functions are:

  • forcats::fct_reorder(): Reordering a factor by another variable.
  • forcats::fct_infreq(): Reordering a factor by the frequency of values.
  • forcats::fct_relevel(): Changing the order of a factor by hand.
  • forcats::fct_lump(): Collapsing the least/most frequent values of a factor into “other”.

We will use some of them later today.

You can take a look at the chapter on Factors in the R for Data Science book and the {forcats} vignette here.


Let’s recode out data into factors

categorical_variables <- c("support_wind_turbines","proximity_to_proposed",
                           "env_concern","trust_in_government",
                           "left_right_ideology", "educ")

wind_factors_df <- wind_df %>% 
  dplyr::mutate_at(dplyr::vars(categorical_variables), as.factor) %>% # turn into factors
  dplyr::mutate(
    educ = forcats::fct_recode(
      educ,
      "High school or less" = "1",
      "University degree" = "2",
      "Postgraduate degree" = "3"
    ),
    env_concern = forcats::fct_recode(
      env_concern,
      "Low" = "1",
      "Medium" = "2",
      "High" = "3"
    ),
    income_1000 = income/1000) #income in 1000s

#wind_factors_df$env_concern

Note 💡

When running logistic regression in R, it is generally recommended that all the qualitative variables are transformed into factors, since models will treat numeric inputs as continuous constructs by default.


Logistic Regression in R

Regression is a key tool in statistics for exploring relationships between variables. Some of the two most common types are linear and logistic regression. In most cases, we use:

  • Linear regression when the dependent variable is quantitative.
  • Logistic regression when the dependent variable is qualitative.

Types of regression (so far)

Linear regression

  • Simple linear regression: one independent variable, continuous dependent variable.
  • Multiple linear regression: two or more independent variables, continuous dependent variable.

Logistic regression

  • Binary logistic regression: binary dependent variable (two outcomes).
  • Multinomial logistic regression: nominal dependent variable with 3+ unordered categories.
  • Ordinal logistic regression: ordinal dependent variable with 3+ ordered categories.

Logistic and probit regressions, which you saw during this week’s lecture, are both part of Generalized Linear Models (GLM), which extend linear models to handle non-continuous outcomes and non-normal residuals. In the labs we will focus on logistic regression, but the general syntax to set up the models and logic of extracting predictive margins with {marginaleffects} can be applied to probit models.


Let’s turn to the task at hand

The Ministry has asked you to analyze these data from a nationally representative survey of 20,000 Burovian residents, focusing on the question:

Do you support the construction of wind turbines within 5 kilometers of residential areas?

The ministry is particularly interested in:

  • Whether opposition is driven more by ideology, self-interest, or misinformation
  • How support varies based on income, proximity, environmental beliefs, and trust in government
  • Whether seemingly “pro-environmental” individuals are still susceptible to NIMBY tendencies

Your findings will inform public communication strategy, siting policies, and the design of community engagement programs.


Logistic regression with one independent variable

Say we want to examine the relationship between income and the likelihood of supporting wind turbine projects.

A binary logistic regression is appropriate when the dependent variable is categorical with two levels (e.g., support vs. no support), and the independent variable(s) can be either categorical or continuous.

In this case, the dependent variable is whether a resident supports the wind turbines (coded as 1 for support and 0 for no support), and the independent variable is income (a continuous variable).

In R, logistic regression can be done using the glm() function, specifying with the argument family = "binomial". The syntax is similar to other formulas we have used in the past:

glm(outcome_of_interest ~ predictor,
family = binomial,
data = data_frame)

Let’s build our model

income_model <- 
  glm(support_wind_turbines ~ income_1000,
      family = binomial,
      data = wind_factors_df
  )

modelsummary::modelsummary(income_model,
                           stars = c('*' = .1, '**' = .05, '***' = .01),
                           statistic = 'conf.int')
(1)
* p < 0.1, ** p < 0.05, *** p < 0.01
(Intercept) -0.759***
[-0.836, -0.683]
income_1000 0.039***
[0.038, 0.041]
Num.Obs. 20000
AIC 18986.1
BIC 19001.9
Log.Lik. -9491.031
F 2415.373
RMSE 0.39

How to interpret our coefficients? 🤔

The (Intercept) (\(β_0\)) is the constant term, and the income_1000 coefficient (\(β_1\)) shows the expected changes of income on the likelihood of supporting wind turbines.

  • The estimate for income_1000: This coefficient indicates how much the log-odds of supporting wind turbines changes with a one-unit increase in income (in thousands).

Interpreting log-odds is not the most intuitive. A general rule to keep in mind is:

  • when \(β_1 = 0\), \(X\) and \(Y\) are independent,
  • when \(β_1 > 0\), the probability that \(Y=1\) increases with \(X\),
  • when \(β_1 < 0\), the probability that \(Y=1\) decreases with \(X\),

When \(β_1\) is positive, an increase in income increases the probability of supporting wind turbines.

Turning it into odds ratios: We can also shift to odds ratios to quantify this relationship and interpret it as follows:

\[exp(β_1) = exp(0.039) = 1.0397\]

Based on this result, we can say that an extra 1,000 dollars of income increases the odds (that is, the chance) of supporting by a factor of ≈1.04.


In R:

exp(0.039)
## [1] 1.03977

Predicting from our models

Estimating the relationship between variables is one reason for building models. Another goal is to predict the dependent variable based on newly observed values of the independent variable(s). This can be done with the predict() function.

Suppose we would like to predict the probability of supporting building wind turbines within 5km of residential areas for an individuals with $70000 of income:

predict(income_model,
        newdata = data.frame(income_1000 = 70),
        type = "response"
)
##         1 
## 0.8805791

We would predict that a Burovian with $70000 of income would have an 88% change of supporting the building of wind turbines within 5km of residential areas.

We can also construct a confidence interval for this prediction, it can be done by adding the se = TRUE argument in the predict() function:

pred_income <- predict(income_model,
                       newdata = data.frame(income_1000 = 70),
                       type = "response",
                       se = T
)

pred_income$fit
##         1 
## 0.8805791
## creating CIs
lower <- pred_income$fit - (qnorm(0.975) * pred_income$se.fit)
upper <- pred_income$fit + (qnorm(0.975) * pred_income$se.fit)
c(lower, upper)
##         1         1 
## 0.8747801 0.8863782

Visualizing the results of the model

We can also use the {marginaleffects} package to provide a visual depiction of our model. We can use the function marginaleffects::plot_predictions() to plot our conditional predictions.

marginaleffects::plot_predictions(income_model, condition = "income_1000") 

It is wrapped around {ggplot2}, so we can use some of the syntax we learned in the past to go beyond the defaults. We can also show where our prediction for $70K lies:

marginaleffects::plot_predictions(income_model, condition = "income_1000") +
  geom_point(x = 70, y = 0.8805791, color = "#cc0065", size = 2) + 
  geom_vline(xintercept = 70, alpha = 0.5, color = "#cc0065", linetype = "dashed") + # this is our prediction for 70
  theme_minimal() +
  labs(x = "Income (thousands)",
       y = "Predicted probability of support\nfor wind turbines")


Logistic regression with multiple independent variables

The interpretation of the coefficients in multivariable logistic regression is similar to the interpretation in univariable regression, except that this time it estimates the multiplicative change in the odds in favor of \(Y = 1\) when \(X\) increases by 1 unit, while the other independent variables remain unchanged.

For this illustration, suppose we would like to estimate the relationship between supporting wind turbine projects and all variables present in the data frame:

multivariate_model <- glm(support_wind_turbines ~ proximity_to_proposed + env_concern +
                            trust_in_government + left_right_ideology + age + income_1000 + educ,
                          family = binomial,
                          data = wind_factors_df)

modelsummary::modelsummary(multivariate_model,
                           stars = c('*' = .1, '**' = .05, '***' = .01),
                           statistic = 'conf.int')
(1)
* p < 0.1, ** p < 0.05, *** p < 0.01
(Intercept) -0.862***
[-1.024, -0.700]
proximity_to_proposed1 -0.783***
[-0.866, -0.699]
env_concernMedium 1.198***
[1.106, 1.290]
env_concernHigh 2.326***
[2.212, 2.441]
trust_in_government1 0.526***
[0.447, 0.605]
left_right_ideology1 -0.652***
[-0.731, -0.572]
age -0.021***
[-0.023, -0.019]
income_1000 0.049***
[0.047, 0.051]
educUniversity degree 0.291***
[0.204, 0.377]
educPostgraduate degree 0.635***
[0.524, 0.747]
Num.Obs. 20000
AIC 15689.4
BIC 15768.4
Log.Lik. -7834.675
F 421.034
RMSE 0.35

Like previous example with one predictor, it is easier to interpret these relationships through OR. But this time, we also print the 95% CI of the OR in addition to the OR (rounded to 3 decimals) so that we can easily see which ones are significantly different from 1:

round(exp(cbind(OR = coef(multivariate_model), confint(multivariate_model))), 3) %>% 
  knitr::kable() %>% 
  kableExtra::kable_styling()
OR 2.5 % 97.5 %
(Intercept) 0.422 0.359 0.496
proximity_to_proposed1 0.457 0.420 0.497
env_concernMedium 3.313 3.023 3.632
env_concernHigh 10.238 9.138 11.490
trust_in_government1 1.692 1.563 1.831
left_right_ideology1 0.521 0.481 0.564
age 0.979 0.977 0.981
income_1000 1.051 1.049 1.052
educUniversity degree 1.337 1.227 1.459
educPostgraduate degree 1.887 1.689 2.110

From the ORs and their 95% CIs computed above, we conclude that:

  • Proximity to the proposed site: individuals who live close to the proposed site have odds of supporting the project that are 0.46 times those of individuals who do not live nearby, all else being equal. This indicates a negative association between proximity and support.

  • Environmental concern: individuals with medium environmental concern have odds of support that are 3.31 times those with low concern, while individuals with high environmental concern have odds 10.24 times higher than those with low concern, all else being equal. These strong and significant associations suggest that higher environmental concern is positively associated with support for the project.

  • Trust in government: individuals who trust the government have odds of supporting the project that are 1.69 times higher than those who do not, all else being equal.

  • Left-right ideology: individuals who lean right on the ideological spectrum have odds of support that are 0.52 times those who lean left, indicating a negative relationship between right-leaning ideology and support for the project.

  • Age: for each additional year of age, the odds of supporting the project are multiplied by 0.98, suggesting that older individuals are slightly less likely to support the project, all else being equal.

  • Income: for each additional 1,000 Burovian dollars of income, the odds of supporting the project increase by a factor of 1.05. This positive and significant association indicates that wealthier individuals are more likely to support the project.

  • Education: individuals with a university degree have odds of support that are 1.34 times those of individuals with a lower level of education, while those with a postgraduate degree have odds that are 1.89 times higher, all else being equal.

  • Intercept: we refrain from interpreting the intercept, as it represents the odds of support for a hypothetical individual with all predictors set to zero or to the reference category, which is not substantively meaningful in this context.


Interactions

In the previous sections, we omitted potential interaction effects.

As we learned last week, an interaction occurs when the relationship between an independent variable and the dependent variable depends on the value taken by another independent variable. In other words, if the effect of one predictor on the outcome changes depending on the level of another predictor, then there is evidence of an interaction.

Based on what we know from Burovia, we hypothesize that our expectations of proximity to a proposed development on support for the development may vary depending on a respondent’s level of environmental concern.

Why might environmental concern and proximity interact?

From the perspective of environmental sociology and the literature on Not In My Back Yard (NIMBY) dynamics—something that the Ministry itself seems to suspect might influence public reactions—prior research suggests that people’s support or opposition to local developments is shaped by both their broader values and worldviews, such as environmental concern, and their geographic proximity to the proposed site.

While environmental concern might lead people to support renewable energy in principle, this support can become conditional when the development is proposed near their homes. This is a classic NIMBY scenario: a person may be in favor of wind energy as a policy, but not want turbines visible from their backyard.

  • Those high in environmental concern may be more likely to support wind energy generally. But when the project is close to them, they may become skeptical—worrying about noise, visual impacts, or ecological disruption to their immediate surroundings. Their support is principled, but may falter when the costs feel personal.

  • Those low in environmental concern may be less engaged either way. Proximity might not change their views significantly, or they may focus more on property rights, aesthetics, or distrust of government projects regardless of environmental reasoning.

This leads to a plausible interaction hypothesis:

Our expectations for the changes in support for the construction of wind turbines within 5km of resiential areas by proximity to a proposed development depends on a person’s level of environmental concern.

interaction_model <- glm(support_wind_turbines ~ proximity_to_proposed * env_concern +
                            trust_in_government + left_right_ideology + age + income_1000 + educ,
                          family = binomial,
                          data = wind_factors_df)

modelsummary::modelsummary(interaction_model,
                           stars = c('*' = .1, '**' = .05, '***' = .01),
                           statistic = 'conf.int')
(1)
* p < 0.1, ** p < 0.05, *** p < 0.01
(Intercept) -0.857***
[-1.021, -0.694]
proximity_to_proposed1 -0.794***
[-0.910, -0.677]
env_concernMedium 1.178***
[1.067, 1.290]
env_concernHigh 2.341***
[2.200, 2.485]
trust_in_government1 0.525***
[0.447, 0.605]
left_right_ideology1 -0.652***
[-0.732, -0.572]
age -0.021***
[-0.023, -0.019]
income_1000 0.049***
[0.047, 0.051]
educUniversity degree 0.291***
[0.204, 0.378]
educPostgraduate degree 0.635***
[0.524, 0.746]
proximity_to_proposed1 × env_concernMedium 0.056
[-0.132, 0.245]
proximity_to_proposed1 × env_concernHigh -0.036
[-0.263, 0.191]
Num.Obs. 20000
AIC 15692.7
BIC 15787.6
Log.Lik. -7834.370
F 344.348
RMSE 0.35
Getting predictive margins
marginaleffects::avg_predictions(interaction_model, by = c("env_concern", "proximity_to_proposed"))
## 
##  env_concern proximity_to_proposed Estimate Std. Error     z Pr(>|z|)   S 2.5 %
##       Low                        0    0.667    0.00535 124.6   <0.001 Inf 0.656
##       Low                        1    0.529    0.00860  61.6   <0.001 Inf 0.513
##       Medium                     0    0.823    0.00525 156.8   <0.001 Inf 0.812
##       Medium                     1    0.734    0.00892  82.2   <0.001 Inf 0.716
##       High                       0    0.926    0.00381 243.0   <0.001 Inf 0.918
##       High                       1    0.866    0.00735 117.8   <0.001 Inf 0.852
##  97.5 %
##   0.677
##   0.546
##   0.833
##   0.751
##   0.933
##   0.880
## 
## Type:  response 
## Columns: env_concern, proximity_to_proposed, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high
marginaleffects::plot_predictions(interaction_model, by = c("env_concern", "proximity_to_proposed")) +
  labs(x = "Environmental concern level",
       y = "Predicted probability of support\nfor wind turbines",
       color = "Proximity to proposed site") +
  scale_color_manual(labels = c("0" = "Does not live nearby", 
                                  "1" = "Lives nearby"),
                       values = c("#5C9184","#550055")) +
  theme_minimal() +
  theme(legend.position = "bottom")


Drafting some brief recommendations

After analyzing the data from the national survey, you report back to the Ministry. You find that, on average, support for the wind turbine project is moderately high—but not uniform. Notably, support tends to decline among individuals who live within 5 kilometers of a proposed site, pointing to a classic NIMBY effect even for seemingly “pro-environmental” individuals.

You explain that most Burovians may support renewable energy in principle, but become more skeptical when the costs feel personal. This finding suggests that even pro-environmental values may not be sufficient to ensure local support in the absence of targeted engagement.


Acknowledgements

This tutorial is inspired by some sections of Antoine Soetewey’s Binary logistic regression in R. You should also take a look at the tutorial if interested on prediction and assessing model quality for that purpose.

This script was drafted by Carolina Díaz Suárez and Sebastian Ramirez-Ruiz, with contributions by Sofía García-Durrer, and William Fernandez.